I've been on the search for a generalized non-parametric description of data I analyze. Finally I've learned of the Dirichlet Distribution. In this post I'm going to explore how one might take a limited amount of information and form a distribution without any prior knowledge or assumptions.

A key realization I've had is that I label distributions as "normal" or "exponential" way too easily. What if, instead, we could model a distribution without an overly constraining model and could instead lean on probability and Bayesian statistics.

But how can we approach a problem in this way? This post is the result of my own exploration on this subject... I would not say it's "correct" and you should do it. I do think it's an interesting approach however and if you can point me to more information on it, I'd love to learn.

First let's start with a seemingly simple problem. We receive the following data (shown as a histogram)... What is our belief in what the next sampled number will be?


In [116]:
data = [1,1,5,5,5,7,7,7,7,7,7,9]
plt.hist(data)


Out[116]:
(array([ 2.,  0.,  0.,  0.,  0.,  3.,  0.,  6.,  0.,  1.]),
 array([ 1. ,  1.8,  2.6,  3.4,  4.2,  5. ,  5.8,  6.6,  7.4,  8.2,  9. ]),
 <a list of 10 Patch objects>)

To convert this into a probability distribution we have to make a decision. We've received zero observations of 0, 2, 3, 4, 6, 8, and 10. Does that mean we expect that we'll NEVER see any of this numbers? What about 11? Or 15? Personally I'm not sure of that. Let's do this.

What if we set some finite range as the possible ranges we're interested in? Let's say 20. For the sake of argument then, let's say we began our previous experiment assuming every one of those values is equally possible. We call this an uninformative prior. This is what that looks like as a histogram:


In [143]:
x1,x2,y1,y2 = plt.axis()
plt.axis((0,19,0,2))
plt.hist(range(0,19), bins=range(0,20))


Out[143]:
(array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 <a list of 19 Patch objects>)

With this in mind, let's add the data we observed to this extremely naive model:


In [144]:
hypothesis_count = 19
data = [1,1,5,5,5,7,7,7,7,7,7,9]
data_histogram = np.array(np.histogram(data, bins=range(0,20)))
data_histogram[0] = data_histogram[0]+1
plt.bar(range(0,hypothesis_count), data_histogram[0])


Out[144]:
<Container object of 19 artists>

Now we don't have any numbers in the range we're considering with zero observations. That may seem odd at first, but after we start to get more data, we'll see this become less and less important. In the meantime, if we want to claim complete ignorance, ascribing this equal treatment is ideal. Whether or not you'll want to actually want to pretend as though you are in complete ignorance is a more advanced topic however.

Next let's take the data we've observed thus far and create a fuzzy distribution that gives us a bound on what this data may look like as we continue to observe new data. To do this, we'll need to use a distribution known as a Dirichlet distribution. How does it work?

Let's take a step back and look at our histogram in a slightly different way.

What if we thought of an observation as having some likelihood as falling into one of the discrete buckets we've defined above? You might notice that 7 is the single most likely bucket with 5 and 3 following closely behind. Now again does this mean we think there's a 0% chance of a 6 being observed? Probably not. 0% is a VERY strong statement. Translated into normal human speak it means that a 6 is impossible. That's rather close minded.

Instead we can model every individual bin as a beta probability distribution. Likewise we can see that observing a 0 can be represented by the beta distribution with parameters Beta(1,32). Why? Well one is the result of our uninformative prior. The 32 is the total number of observation we've had across all bins. This means that the distribution for a 0 observation looks like this:


In [11]:
sampled_data = np.array([np.random.beta(1,32) for i in range(0,1000)])
plt.hist(sampled_data, bins=arange(0,1, step=.01))
print ''



So the likelihood that we observe a value of zero is very likely but not the only possibility. This sort of distribution exists for each of the buckets in our range of possibilities. In terms of a confidence (or credibility) interval, you might say there's a 95% chance the actual likelihood is between 0% and 10% that we will observe a value of zero.

What if we visualize this distribution for all of the possible observations in our distribution? We can do that!


In [188]:
options = []
samples = 10000
sampled_sim = np.random.dirichlet(data_histogram[0], samples)
for entry in sampled_sim:
    for i in range(0,hypothesis_count):
        if len(options) <= i:
            options.append([])
        options[i].append(entry[i])
ci_histograms = []
for i in range(0,hypothesis_count):
    sorted_data = sorted(zip(*np.histogram(options[i], bins=100)), key=lambda x: x[0], reverse=True)
    data_in_hdi = []
    total = 0
    for i in sorted_data:
        if total < floor(samples*.95):
            total += i[0]
            data_in_hdi.append(i[1])
    minimum = min(data_in_hdi)
    maximum = max(data_in_hdi)
    ci_histograms.append((minimum, maximum))
plt.plot(ci_histograms)


Out[188]:
[<matplotlib.lines.Line2D at 0x11625ce10>,
 <matplotlib.lines.Line2D at 0x116091290>]

This is the 95% credibility interval for our distribution. To sanity check look at the likelihood of observing a value of 0. Looks like 0-10% chance and that matches the manual computation we just performed for this case.

What is this graph telling us? Our likelihood of observing a value of 0-1 in our data is between 0-10%. That much we've discussed. What about observing a value of 1-2? Somewhere between 2-20% chance. What about the likelihood of observing a 7? Somewhere around 10%-37%. We now have a fuzzy distribution that sets some constraints on what we should believe. But the real magic is in how this will shift as we observe more data.

But wait, what is this crazy I'm using to create this? I'm running simulations. Probably notnecessary since this is a fairly straight forward distribution, but I'm a better programmer than a mathematician.

Let's imagine we observe another 12 data points.


In [192]:
moar_data = [1,1,5,5,5,7,7,7,7,7,7,9, 2,7,3,5,7,9, 1,5,7,7,8,6]
plt.hist(moar_data)


Out[192]:
(array([  3.,   1.,   1.,   0.,   0.,   5.,   1.,  10.,   1.,   2.]),
 array([ 1. ,  1.8,  2.6,  3.4,  4.2,  5. ,  5.8,  6.6,  7.4,  8.2,  9. ]),
 <a list of 10 Patch objects>)

Adding in our uninformative prior like we did before...


In [193]:
hypothesis_count = 19
moar_data_histogram = np.array(np.histogram(moar_data, bins=range(0,20)))
moar_data_histogram[0] = moar_data_histogram[0]+1
plt.bar(range(0,hypothesis_count), moar_data_histogram[0])


Out[193]:
<Container object of 19 artists>

Notice that the data we've defaulted to being observed once is getting smaller as it still hasn't been observed? Soon we'll add a LOT more data and watch those get reduced dramatically.

First, let's replot the confidence intervals for each bin and see if they've narrowed or widened.


In [195]:
options = []
samples = 10000
moar_sampled_sim = np.random.dirichlet(moar_data_histogram[0], samples)
for entry in moar_sampled_sim:
    for i in range(0,hypothesis_count):
        if len(options) <= i:
            options.append([])
        options[i].append(entry[i])
moar_ci_histograms = []
for i in range(0,hypothesis_count):
    sorted_data = sorted(zip(*np.histogram(options[i], bins=100)), key=lambda x: x[0], reverse=True)
    data_in_hdi = []
    total = 0
    for i in sorted_data:
        if total < floor(samples*.95):
            total += i[0]
            data_in_hdi.append(i[1])
    minimum = min(data_in_hdi)
    maximum = max(data_in_hdi)
    moar_ci_histograms.append((minimum, maximum))
plt.plot(moar_ci_histograms)


Out[195]:
[<matplotlib.lines.Line2D at 0x10877bc10>,
 <matplotlib.lines.Line2D at 0x10877be90>]

Ahhh! The interval has narrowed. For example look at our likelihood to observe a value of zero. Now it's somewhere between 0%-7% or so.

Next let's add a bunch more data and recreate our charts to see where we're at.


In [212]:
even_moar_data = [1,1,5,5,5,7,7,7,7,7,7,9, 2,7,3,5,7,9, 1,5,7,7,8,6, 7,7,7,7,7,7,7, 9,9,9,9,9,9,9, 6,6,6,6, 5,5,5,5,5,5, 6,6,6, 8,8, 2,2,2,2, 3,3,3,3, 4,4,4]
plt.hist(even_moar_data, bins=range(0,19))


Out[212]:
(array([  0.,   3.,   5.,   5.,   3.,  11.,   8.,  17.,   3.,   9.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18]),
 <a list of 18 Patch objects>)

In [213]:
hypothesis_count = 19
even_moar_data_histogram = np.array(np.histogram(even_moar_data, bins=range(0,20)))
even_moar_data_histogram[0] = even_moar_data_histogram[0]+1
plt.bar(range(0,hypothesis_count), even_moar_data_histogram[0])


Out[213]:
<Container object of 19 artists>

In [214]:
options = []
samples = 10000
even_moar_sampled_sim = np.random.dirichlet(even_moar_data_histogram[0], samples)
for entry in even_moar_sampled_sim:
    for i in range(0,hypothesis_count):
        if len(options) <= i:
            options.append([])
        options[i].append(entry[i])
even_moar_ci_histograms = []
for i in range(0,hypothesis_count):
    sorted_data = sorted(zip(*np.histogram(options[i], bins=100)), key=lambda x: x[0], reverse=True)
    data_in_hdi = []
    total = 0
    for i in sorted_data:
        if total < floor(samples*.95):
            total += i[0]
            data_in_hdi.append(i[1])
    minimum = min(data_in_hdi)
    maximum = max(data_in_hdi)
    even_moar_ci_histograms.append((minimum, maximum))
plt.plot(even_moar_ci_histograms)


Out[214]:
[<matplotlib.lines.Line2D at 0x1088cc890>,
 <matplotlib.lines.Line2D at 0x1088ccb10>]

There we go. When we first began every number we hadn't observed had ~10% chance of being observed. Now it's less than 5%.


In [220]:
final_data = even_moar_data + [floor(numpy.random.normal(5,1)) for i in range(1000)]
plt.hist(final_data, bins=range(0,19))


Out[220]:
(array([   0.,    4.,   27.,  140.,  360.,  323.,  153.,   42.,    6.,
           9.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18]),
 <a list of 18 Patch objects>)

In [221]:
hypothesis_count = 19
final_data_histogram = np.array(np.histogram(final_data, bins=range(0,20)))
final_data_histogram[0] = final_data_histogram[0]+1
plt.bar(range(0,hypothesis_count), final_data_histogram[0])


Out[221]:
<Container object of 19 artists>

In [222]:
options = []
samples = 10000
final_sampled_sim = np.random.dirichlet(final_data_histogram[0], samples)
for entry in final_sampled_sim:
    for i in range(0,hypothesis_count):
        if len(options) <= i:
            options.append([])
        options[i].append(entry[i])
final_ci_histograms = []
for i in range(0,hypothesis_count):
    sorted_data = sorted(zip(*np.histogram(options[i], bins=100)), key=lambda x: x[0], reverse=True)
    data_in_hdi = []
    total = 0
    for i in sorted_data:
        if total < floor(samples*.95):
            total += i[0]
            data_in_hdi.append(i[1])
    minimum = min(data_in_hdi)
    maximum = max(data_in_hdi)
    final_ci_histograms.append((minimum, maximum))
plt.plot(final_ci_histograms)


Out[222]:
[<matplotlib.lines.Line2D at 0x10faac5d0>,
 <matplotlib.lines.Line2D at 0x10faac850>]

After 1,000 more samples (drawn from a normal distribution), we can see our chance of observing a zero value is practically zero as is our likelihood of observing values of 10 and above. Also now numbers around 4 are the most likely though 7 is still quite possible.

At this sample size we can see how this bayesian approach to data distributions can even model distributions seen as standard within frequentist statistics.

That's it. We've got a general model for a distribution that can be used to model any data we have with no real assumptions for how it will be distributed except for probability.

I wouldn't say this is a rigorous treatment. It's more my own exploration of the subject. Still I think it's an interesting approach to coming at the problem without any prior assumptions. In a future post I'm hoping to explore the differences between two distributions modeled in this way. The ultimate goal is to figure out a non-parametric bayesian test for differences between two distributions.